Feature/remove large in clause in assets with cte and join#62114

Open

Nataneljpwd wants to merge 18 commits intoapache:mainfrom

Nataneljpwd:feature/remove-large-in-clause-in-assets

Contributor

Nataneljpwd commented Feb 18, 2026

Closes: #61453
This issue solves the large in clause using a cte with a join rather than batching

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)
No

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

Natanel Rudyuklakir added 2 commits

February 16, 2026 22:15


          fixed large in clause

43478b1


          fixed tests

889db8c

Nataneljpwd requested review from XD-DENG and ashb as code owners

February 18, 2026 08:51

boring-cyborg bot added the area:Scheduler label


          merged

9110b93

Nataneljpwd force-pushed the feature/remove-large-in-clause-in-assets branch from e2000b8 to 9110b93 Compare

February 18, 2026 11:41

Natanel Rudyuklakir added 3 commits

February 18, 2026 22:00


          changed to delete using

6b47335


          Merge branch 'main' of https://github.com/apache/airflow into feature…

e70ddbb

…/remove-large-in-clause-in-assets


          added compatibility for tests

db52248

Nataneljpwd marked this pull request as draft

February 18, 2026 20:11

Nataneljpwd and others added 3 commits

February 18, 2026 22:33


          Change asset selection to use CTE

493d0bd


          Fix asset selection for Airflow versions

557b3c4


          fixed some tests and optimized the query

cbe568c

Nataneljpwd mentioned this pull request

Avoid large tuple IN query in SchedulerJobRunner._activate_referenced_assets on PostgreSQL (performance / perceived hanging) #61453

Open

2 tasks


          fixed all tests

4c8adc3

Nataneljpwd marked this pull request as ready for review

February 24, 2026 07:16

Natanel Rudyuklakir added 2 commits

February 24, 2026 21:47


          Merge branch 'main' of https://github.com/apache/airflow into feature…

129fcad

…/remove-large-in-clause-in-assets


          fixed mypy

ebc08dd

Nataneljpwd force-pushed the feature/remove-large-in-clause-in-assets branch from 2192078 to ebc08dd Compare

February 24, 2026 20:20

Asquator suggested changes

View reviewed changes

Asquator left a comment

Nice improvement, in clauses should always be avoided on values residing in the DB if possible.

airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

airflow-core/src/airflow/jobs/scheduler_job_runner.py

Comment on lines +3025 to -3037

+                          select(AssetModel)
                           .outerjoin(DagScheduleAssetReference)
                           .outerjoin(TaskOutletAssetReference)
                           .outerjoin(TaskInletAssetReference)
                           .group_by(AssetModel.id)
-                          .order_by(orphaned)

Asquator Feb 25, 2026

Did you consider extracting this to a helper function in asset.py, like many others located there?

Contributor Author

Nataneljpwd Feb 26, 2026

I do not see a benefit from doing it, and so I did not do it, do you have a reason for the request? as I might have missed something

airflow-core/src/airflow/jobs/scheduler_job_runner.py

Comment on lines +3052 to +3054

+                      active_assets_query = select(AssetActive.name, AssetActive.uri).join(
+                          assets_query,
+                          and_(AssetActive.name == assets_query.c.name, AssetActive.uri == assets_query.c.uri),

Asquator Feb 25, 2026

Can this be a helper function in asset.py too?
Just to avoid adding even more logic into the scheduler.

Contributor Author

Nataneljpwd Feb 26, 2026

same as above

airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

Asquator reviewed

View reviewed changes

airflow-core/src/airflow/jobs/scheduler_job_runner.py

+                          and_(AssetActive.name == assets_query.c.name, AssetActive.uri == assets_query.c.uri),
                       )
+                      active_assets = session.execute(active_assets_query).all()

Asquator Feb 25, 2026

If there are users with thousands of active assets, I wonder if this may explode one day.

Contributor Author

Nataneljpwd Feb 26, 2026

it is a good point, maybe it is out of scope of the given PR, I might open a new PR for this after to handle large scale, as if, batching, yet as of now it is not an issue, and so for now I will leave it as is

Asquator reviewed

View reviewed changes

airflow-core/src/airflow/jobs/scheduler_job_runner.py

    
                          session.execute(

                              delete(AssetActive).where(

                                  tuple_(AssetActive.name, AssetActive.uri).in_((a.name, a.uri) for a in assets)

                  def _orphan_unreferenced_assets(assets_query: CTE, *, session: Session) -> None:

Asquator Feb 25, 2026

Maybe we can avoid passing a CTE as an argument (which is not intuitive) by using the helper function.

Contributor Author

Nataneljpwd Feb 26, 2026

what do you suggest then?
it had the least amount of duplicated code, if there are any suggestions, I would be happy to hear

Asquator Feb 26, 2026

asset_reference_query is a static query that never changes. If it's referenced in two places, maybe it's worth extracting it as a helper function, again?

Asquator Feb 26, 2026

This way we won't be passing CTEs as method parameters

Contributor Author

Nataneljpwd Feb 28, 2026

it is harder to track that way in my opinion

way simpler to just see a query passed rather than go to a different method

Asquator Mar 1, 2026

I don't think it's harder to track. As it's a constant, reusable CTE, I would put it as a cached util function in the corresponding module instead of generating it in the scheduler code.

Natanel Rudyuklakir added 2 commits

February 26, 2026 18:07


          fixup! address cr comments

aeb97b7


          Merge branch 'main' of https://github.com/apache/airflow into feature…

ff0347f

…/remove-large-in-clause-in-assets

Nataneljpwd force-pushed the feature/remove-large-in-clause-in-assets branch from d8e235d to ff0347f Compare

February 26, 2026 16:08

kaxil reviewed

View reviewed changes

Member

kaxil left a comment

Thanks for pushing this — the SQL-shape change is in the right direction for #61453. I left two inline comments for follow-up before merge.

devel-common/src/tests_common/pytest_plugin.py Outdated

+                          assets = select(AssetModel).where(assets_select_condition).cte()
+                          if not AIRFLOW_V_3_2_PLUS:
+                              assets = self.session.scalars(select(assets)).all()

Member

kaxil Feb 27, 2026

For Airflow <3.2 this fallback currently uses scalars(select(assets)) where assets is a CTE built from select(AssetModel). scalars() returns only the first selected column, so this becomes a list of IDs (not AssetModel objects). That can break _activate_referenced_assets when it expects .name / .uri. Could we keep the old pre-3.2 materialization query, or join the CTE back to AssetModel before calling scalars()?

Contributor Author

Nataneljpwd Feb 28, 2026

Sure, I will join back to the asset model

airflow-core/tests/unit/jobs/test_scheduler_job.py

    
                      asset_models = session.scalars(select(AssetModel)).all()

                      assert len(asset_models) == 3

                      asset_models = select(AssetModel).cte()

Member

kaxil Feb 27, 2026

Would you add an explicit regression assertion for #61453's failure mode (large tuple-IN bind expansion)? These tests now validate behavior with a CTE input, but they don't directly guard against reintroducing a huge (name, uri) IN (...) path in scheduler asset activation.

Contributor Author

Nataneljpwd Feb 28, 2026

How do you think this can be added? As it does not cause failure when using in, rather just cause some slowdown

The only think I can think of is to check for the keyword 'in' for the str of the query

Contributor Author

Nataneljpwd Feb 28, 2026

found a way to make it work with event listeners in sqlalchemy, added the test


          Merge branch 'main' of https://github.com/apache/airflow into feature…

7128ccf

…/remove-large-in-clause-in-assets

Nataneljpwd requested a review from Asquator

February 28, 2026 13:35


          address CR comments, added tests and fixed plugin

Nataneljpwd requested a review from kaxil

February 28, 2026 14:31

Contributor Author

Nataneljpwd commented Feb 28, 2026

Hello @kaxil, I have fixed the comments, I would appreciate a review

Nataneljpwd added 2 commits

February 28, 2026 16:32


          Merge branch 'main' into feature/remove-large-in-clause-in-assets

52a9c16


          Merge branch 'main' into feature/remove-large-in-clause-in-assets

31a6d00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

ashb Awaiting requested review from ashb ashb is a code owner

XD-DENG Awaiting requested review from XD-DENG XD-DENG is a code owner

Asquator Awaiting requested review from Asquator

kaxil Awaiting requested review from kaxil

At least 1 approving review is required to merge this pull request.

Labels